Add DIGEST-MD5 SASL delegation token auth to HiveCatalog#3150
Add DIGEST-MD5 SASL delegation token auth to HiveCatalog#3150ShreyeshArangath wants to merge 5 commits intoapache:mainfrom
Conversation
Enable PyIceberg's HiveCatalog to authenticate using DIGEST-MD5 SASL with delegation tokens from $HADOOP_TOKEN_FILE_LOCATION, which is the standard mechanism in secure Hadoop environments. This unblocks PyIceberg adoption in production clusters that don't use Kerberos directly. - Add HiveAuthError exception for Hive-specific auth failures - Add hadoop_credentials module to parse HDTS binary token files - Add _DigestMD5SaslTransport to work around THRIFT-5926 (None initial response) - Support hive.metastore.authentication property (NONE/KERBEROS/DIGEST-MD5) - Add pure-sasl to hive extras in pyproject.toml - Backward compatible: existing kerberos_auth boolean still works Closes apache#3145 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address all findings from code review: Critical: - Rewrite VInt decoder to match Java WritableUtils.readVLong exactly, using signed-byte interpretation and correct prefix/length semantics High: - Catch OSError (not just FileNotFoundError) when reading token file - Reject unknown auth mechanisms with HiveAuthError instead of silently falling back to unauthenticated TBufferedTransport - Replace monkey-patching sasl.process in _DigestMD5SaslTransport with a clean send_sasl_msg override (thread-safe, no shared state mutation) Medium: - Fix kerberos_service_name default from config key to actual value - Wrap UnicodeDecodeError in HiveAuthError for invalid UTF-8 in tokens - Rewrite VInt test encoder to match real Hadoop encoding format - Fix dead kerberos backward-compat tests to actually exercise __init__ Low: - Add upper bound to pure-sasl dependency (<1.0.0) - Fix tmp_path typing from object to pathlib.Path - Fix docs to say pure-sasl (pip package name) not puresasl Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The parent TSaslClientTransport.send_sasl_msg() has no type annotations, so there is no override incompatibility for mypy to suppress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
||
|
|
||
| class _DigestMD5SaslTransport(TTransport.TSaslClientTransport): | ||
| """TSaslClientTransport subclass that works around THRIFT-5926. |
There was a problem hiding this comment.
This will be addressed as part of apache/thrift#3342
- Document that hive.kerberos-service-name applies to both KERBEROS and DIGEST-MD5 - Add precedence note for hive.metastore.authentication vs legacy boolean - Add test for empty-string auth mechanism raising HiveAuthError - Add integration test for KERBEROS via hive.metastore.authentication config - Expand HiveAuthError docstring to cover token file errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| hive = ["thrift>=0.13.0,<1.0.0"] | ||
| hive = [ | ||
| "thrift>=0.13.0,<1.0.0", | ||
| "pure-sasl>=0.6.0,<1.0.0", |
There was a problem hiding this comment.
I don't think we always need to install this dependency, right?
Fokko
left a comment
There was a problem hiding this comment.
@ShreyeshArangath Thanks for adding this, have you tested this locally?
| | hive.hive2-compatible | true | Using Hive 2.x compatibility mode | | ||
| | hive.kerberos-authentication | true | Using authentication via Kerberos | | ||
| | hive.kerberos-service-name | hive | Kerberos service name (default hive) | | ||
| | hive.kerberos-service-name | hive | SASL service name used for both KERBEROS and DIGEST-MD5 auth (default hive) | |
There was a problem hiding this comment.
Should a new property be created? The name hive.kerberos-service-name no longer captures the full scope.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
Rationale for this change
Enable PyIceberg's HiveCatalog to authenticate using DIGEST-MD5 SASL with delegation tokens from
$HADOOP_TOKEN_FILE_LOCATION, which is the standard mechanism in secure Hadoop environments. This unblocks PyIceberg adoption in production clusters that don't use Kerberos directlySummary
Closes #3145
Are these changes tested?
Unit tests
Are there any user-facing changes?
Yes, introduce DIGEST-MD5 SASAL delegation token support